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Abstract. Genome-wide association studies, in which as many as a 
million single nucleotide polymorphisms (SNP) are measured on sev- 
eral thousand samples, are quickly becoming a common type of study 
for identifying genetic factors associated with many phenotypes. There 
is a strong assumption that interactions between SNPs or genes and 
interactions between genes and environmental factors substantially con- 
tribute to the genetic risk of a disease. Identification of such interactions 
could potentially lead to increased understanding about disease mech- 
anisms; drug x gene interactions could have profound applications for 
personalized medicine; strong interaction effects could be beneficial for 
risk prediction models. In this paper we provide an overview of dif- 
ferent approaches to model interactions, emphasizing approaches that 
make specific use of the structure of genetic data, and those that make 
specific modeling assumptions that may (or may not) be reasonable to 
make. We conclude that to identify interactions it is often necessary to 
do some selection of SNPs, for example, based on prior hypothesis or 
marginal significance, but that to identify SNPs that are marginally as- 
sociated with a disease it may also be useful to consider larger numbers 
of interactions. 
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1. INTRODUCTION 

Genome-wide association studies (GWAS), in which 
as many as a million single nucleotide polymorphisms 
(SNP) are measured on several thousand samples, 
are quickly becoming common for identifying ge- 
netic factors associated with many phenotypes. Un- 
til now most analyses of GWAS have taken a one- 
SNP-at-a-time approach, some analyses are employ- 
ing haplotypes, but this is mostly as surrogates for 
unmeasured SNPs. There is, however, the strong as- 
sumption that interactions between SNPs or genes 
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(gene x gene interactions) and interactions between 
genes and environmental factors (gene x environ- 
ment interactions) substantially contribute to the 
genetic risk of a disease (e.g., Prankel and Stork, 
1996; Philips, 2008). Identification of such interac- 
tions could potentially lead to increased understand- 
ing about disease mechanisms; drug x gene interac- 
tions could have profound applications for person- 
alized medicine; strong interaction effects could be 
beneficial for risk prediction models. 

The GWAS that have been carried out to date 
have not lead to the identification of many such in- 
teractions, other than that some of the SNPs, that 
were found to be marginally associated with a dis- 
ease, are sometimes also tested for differences in sub- 
groups (e.g., Gudbjartsson et al., 2007, Table 2). 
There are several reasons for this lack of identifica- 
tion of interactions: 

1. The odds ratios of SNPs that have main effects 
are usually small, and there is no reason to ex- 
pect that interaction effects are any bigger. In- 
creased degrees of freedom for models involving 
both main effects and interaction effects result in 
even lower power to specifically identify the in- 
teractions than the already low power to identify 
main effects. 

2. The potential number of gene x gene interactions 
is very large, for example, ~5 x 10 11 two-SNP in- 
teractions for the 1 million SNP chip, making it 
computationally impossible to do anything more 
than the simplest model for all interactions, and 
reducing the already limited power further be- 
cause of the required multiple comparisons cor- 
rection. 

3. The potential number of gene x environment in- 
teractions is smaller; however, when there are 
several environmental factors of interest the num- 
ber of multiple comparisons for which we have to 
correct is still considerably larger than the num- 
ber of SNPs in the initial scan. In addition, some 
environmental factors may have measurement er- 
ror, further reducing power to identify interac- 
tions involving those factors. 

4. With single SNP models there are a variety of 
(genetic) models to choose from: for example, ad- 
ditive, recessive, dominant and codominant mod- 
els. For interaction models there are many more, 
yielding an unwieldy array of models and ap- 
proaches to choose from, some of which probably 
could be dismissed as not having the power to 



identify interactions from the start. An example 
of such a model with little power is one for kth or- 
der interactions, with k > 2, where all 3 k possible 
combinations of SNPs are modeled separately. 
5. Imputation methods for marginal SNP effects work 
fairly well in identifying disease associated SNPs 
that are unmeasured, although the power for these 
unmeasured SNPs is somewhat lower than for 
measured SNPs; for interactions it becomes infea- 
sible to check all possible interactions involving 
unmeasured SNPs. 

In this paper we will review a number of approaches 
that can be used for finding interactions in GWAS. 
We will use a few guiding principles when using 
these methods: 

• If a main effect fits the data well enough, don't use 
an interaction. This may seem obvious, and many 
established function estimation methods adhere 
to this principle. However, not all models for ge- 
netic interactions adhere to this keep- it-simple prin- 
ciple. 

• Think about which model to use; think how the 
method scales up. This is another reason to keep- 
it-simple. If the model is too computer intensive, 
it may not be feasible to fit it many millions of 
times. 

• If you need to divide the cake, give a slightly larger 
crumb to the people who will enjoy it — that is, 
spend your power on the most likely interactions. 
Good candidates for SNPs involved in gene x 
gene or gene x environment interactions are prior 
hypothesized effects (if there are any) and SNPs 
that show main effects. (Partly) ignoring other 
possible interactions will eliminate the possibil- 
ity of identifying these ignored ones as significant; 
since the power is very small to start out, we may 
as well use it wisely and at least identify the more 
likely interactions. 

• Possibly insignificant interactions may help us to 
identify disease associated genes. If there is some 
difference in genetic effects, looking wisely in sub- 
groups may help you find groups where the effect 
is strongest. 

• Be willing to exploit the genetic structure [e.g., 
linkage (dis)equilibrium, SNPs taking only three 
values], be willing to make some assumptions (e.g., 
gene environment independence) , but be very aware 
what you loose if these assumptions are wrong. 
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Obviously, many of these principles are not that dif- 
ferent from studying interactions in smaller prob- 
lems: the main difference is that the size of the 
problem for logistical reasons forces us to make the 
right choice immediately — we may never get a sec- 
ond chance to correct ourselves. 

Furthermore, the nature of potential models of in- 
teractions has implications both for interpretation 
and statistical power. One class of methods uses 
models leading to traditional odds ratio estimates 
of main effects and interactions. These methods are 
based on multiplicative interactions in a logistic re- 
gression and can be represented more generally as 
tensor product regression spline models. On the other 
hand, another class of strategies address the poten- 
tial for strong associations within subgroups of sub- 
jects; these models include Logic Regression, tree- 
based regression, haplotype analysis (we describe an 
adaptive regression strategy called SHARE in Sec- 
tion 3) or adaptively weighted subgroup analysis. 
We note that many methods which focus on sub- 
group effects should not be interpreted as defini- 
tively describing an interaction in the multiplicative 
sense but rather as tools to increase the potential of 
finding any association between gene and outcome 
or as tools for better risk prediction. We make the 
case, that the choices of interaction strategies which 
are potentially most powerful depend on the setting, 
whether it is gene x gene interaction studies, gene x 
environment studies, dimensionality of the SNP or 
environmental data and hypothesized genetic struc- 
ture. 



The next section starts by giving some general 
background about methods for interaction model- 
ing in data analysis. In Section 3 we discuss some 
methods for interactions that have been developed 
for genetic studies. In Section 3.4 we see how we 
can do some (limited) identification of interactions 
in GWAS, and in Section 4 we see how we can use 
interactions in GWAS to identify SNPs that are 
marginally associated with a disease. We end with 
a brief discussion. 

2. INTERACTIONS IN STATISTICAL 
MODELING 

Interactions are often characterized by departures 
from a simple additive combination of effects in the 
context of some regression model. Such models are 
of interest in a genetic association study, since one 
may like to describe instances where a gene is asso- 
ciated with a disease only in the presence of another 
gene or in combination with an environmental fac- 
tor. Alternatively, modeling more complex models 
including interactions can improve function approx- 
imations to derived better risk prediction models. 

The "Simplest" Interaction Model 

To start the discussion, consider a simple inter- 
action model involving two variables. A logit model 
for a binary disease phenotype Y £ {0, 1} is 

logit[P(Y = 1|X)] 

(1) 

= /3 + /?iX 1 + /? 2 X 2 + /? 3 X 1 X 2 , 




variable 1 variable 1 



Fig. 1. Two different scenarios of interaction effects. In the left panel disease risk is always higher in the {aa, Aa} group; 
in the right panel the disease risk actually is reversed for variable level Zl versus Z2 and Z3. 
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where X\ and X% may code genes (SNPs or hap- 
lotypes). The final term in the model expresses a 
departure from a simple additive model, at least on 
the logit scale. In the setting of genomic associa- 
tion studies, interaction models can also include two 
classes of variables, for instance, genetic and envi- 
ronmental factors. In that case the model can be 
represented as 

logit = i|x, z)] = Pq + p x z + p 2 X! + p 3 zx u 

where Z denotes an environmental variable or some 
other subject exposure variable, such as assigned 
treatment. The classic schema of a two variable in- 
teraction is given in Figure 1 which shows the odds 
of disease as a function of two variables. It shows two 
cases, one where the effect of the second variable is 
only evident (or largely evident) within a subset of 
levels of the first variable and a second case, where 
the effect is actually reversed at one level of the sec- 
ond variable. Sometimes this second case is called a 
qualitative interaction, because unlike the first one, 
it does not depend on the choice of the link func- 
tion to the outcome. Again, the figures would apply 
to both gene x gene and gene x environment in- 
teractions. Most would expect the effects shown in 
the left panel to be the most plausible for gene by 
environmental effects in GWAS; that is, effect mod- 
ification, but not effect reversal. A modest twist for 
GWAS is that one could envisage 100s of thousands 
to millions of such interaction plots. 

However, the simple model and Figure 1 hide im- 
portant aspects to the complexity of actually con- 
structing statistical interaction models in several 
ways: (1) the variables that are involved in the in- 
teraction may only be observable as a derived func- 
tion of several variables; that is, in the models given 
above, the terms X, or Z may represent paramet- 
ric or nonparametric functions of several other vari- 
ables, for instance, several SNPs representing a gene 
(2) the system of predictors could be very high di- 
mensional with respect to variable X (for instance, 
the many SNPs or genes in GWAS) and (3) even 
after selecting variables, there are other problem 
specific issues like phasing, measurement error and 
missingness. For point 1 and more recently for point 
2, there is extensive statistical literature addressing 
function approximation and prediction modeling to 
draw upon and potentially use, at least in concept, 
for the analysis of GWAS. 



Additive Expansion Interaction Models 

The simple interaction model above can be ex- 
tended to a broader class of models for nonlinear, 
nonadditive, multivariate regression methods. As- 
sume that the disease model indexed by regression 
function r/(X) is in some if-dimensional linear space 
B(X), 

K 

(2) ,j(X) = 2><fc(X) 

i=l 

for a given set of basis functions ffi(X), . . . ,g p (X). 
Several nonparametric multivariate regression 
methodologies use this additive expansion (often 
called a basis function expansion). We review three 
common function approximation methods that ex- 
press the basis functions as tensor products of indi- 
vidual covariates. While these methods should prob- 
ably not be directly applied to whole genome data 
for both computational and statistical reasons, they 
can be useful after selecting smaller subsets of vari- 
ables. More importantly, they follow the common 
and important paradigm, useful for modeling inter- 
actions from GWAS, which involves searching for 
models sequentially by first identifying main or (lo- 
cally) marginal effects before fitting higher order terms. 

(a) Logistic Regression. Presented with more than 
two predictor combinations, several variables may 
yield better predictions of the outcome variable. As- 
suming modeling disease probability is a goal, the 
expansion model (2) would have component func- 
tions <?i(X) which include products of two or more 
predictors, for example, <?i(X) = XjX^. An example 
model with at most 2-way interactions is 

p p 
r?(X) = /3 + fa X i + E PjkXjXk, 

3 j<k 

where 77 (X) may represent the conditional logit of 
the probability of disease. Even with this simple 
model form, the number of potential interaction mod- 
els is order p 2 . If one does not limit the interactions 
to only involve pairs of variables, there are order 
2 P models. The numbers quickly rise with high or- 
der interactions and numbers of variables. There- 
fore, even if we keep the model sparse with only a 
few f3j and (3jk not equal to zero, one would an- 
ticipate advantages both in computation and vari- 
ance control by constructing a reasonable pathway 
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though the model space. A standard approach is to 
use forward variable selection and consider adding 
interaction terms if one or more of the variables is 
already included in the model as a main effect term. 
The above model is often not sufficiently flexible for 
prediction modeling especially if Xj are multilevel or 
continuous. We consider some alternatives below. 

(b) Regression Spline Methods. One possibility is 
that the <7i(X) in the expansion model (2) are tensor 
products of piecewise linear splines. Some examples 
of methods using such an expansion are Multivari- 
ate Adaptive Regression Splines (MARS, Friedman, 
1991) and related spline methods Hazard Regres- 
sion HARE and Poly MARS (e.g., Kooperberg et al., 
1997). 

Regression spline algorithms exploit lower marginal 
or additive structure to guide the search for inter- 
actions In addition, these methods use basis func- 
tions that are tensor products of basis functions in 
one dimension. For example, if ffi(X) = bi{Xk) and 
<?2(X) = 62 (X[) are two basis functions that depend 
on a single predictor, the 53 (X) = b\ {X k ) 62 {X{) is a 
tensor product basis function that depends on two 
predictors. Truncated linear basis splines are used 
to deal with continuous or ordered covariates {X{ — 
t ki ) + = {Xi - t ki )I{Xi > t ki }. Given that SNP data 
only has only 3 categories, piecewise linear compo- 
nents are probably most useful with respect to envi- 
ronmental factors in gene x environment interaction 
models. Typically interactions of variables are in- 
cluded only if one or both of the variables are already 
identified as single variable terms [e.g., b\ (X k ) or 
&2PQ)] described above. This strategy yields more 
interpretable models since the models contain main 
effects and it also limits the search over the num- 
ber of possible models, which better controls vari- 
ance compared to a search that evaluates all ten- 
sor products. The exact restrictions on when tensor 
product basis functions are allowed in spline models 
differs from one methodology to the other: for exam- 
ple, MARS (Friedman, 1991) has fewer restrictions 
than HARE (Kooperberg et al., 1995) and Polyclass 
and PolyMARS (Kooperberg et al., 1997). All these 
methods identify lower order effects first to control 
the search for higher order interaction terms and 
hence control the variability of the search. 

When building more complex models, one may 
also introduce regularization or penalization to re- 
duce the impact of including too many parameters 
or cells with very small counts of observations. Park 



and Hastie (2008) develop an algorithm that con- 
structs penalized regression models for detecting gene 
environmental interaction models which only includes 
interactions if one of the main effects are in the 
model. Three level SNP variables are coded as dummy 
variables and it uses quadratic penalization to sta- 
bilize the estimation. 

(c) Tree Regression. Regression trees are flexible 
methods capable of capturing interactions by re- 
cursively selecting and partitioning data based on 
low order associations. For tree based methods, the 
<7i(X) in the expansion model (2) are indicator func- 
tions corresponding to rectangular regions, Rf L , of 
the predictor space, gn(X) = I{X € Rh}- The best 
known example in the statistical literature is Clas- 
sification and Regression Trees (CART, Breiman et 
al., 1984). Therefore, tree models can be represented 
as a binary tree T, where the set of terminal nodes 
T corresponds to the partition of the covariate space 
into disjoint subsets. A tree model can also be ex- 
pressed by a basis function representation 

^(X) = ^rj h g h (X), 

where <#i(X) is the region corresponding to a ter- 
minal node h. This function is a tensor product 
g h (X) = I{Xi € S!}---{X p € S p }. To control the 
amount of computation, and to construct a lim- 
ited path through the large class of potential tensor 
product interaction models of this form, the model 
is grown in a forward stepwise fashion, similar to 
stepwise regression. The assumption used by tree 
regression is that effects can be found by searching 
for a local marginal association with outcome. The 
method is applied to the entire data set and pre- 
dictor space, each variable and potential split point 
is evaluated. Of course if the predictors were just 
SNPs with coding {0, 1,2}, then only two splits are 
possible: one corresponding to a recessive effect and 
one corresponding to a dominant effect. The split 
point and variable that leads to the "best" split (as 
described below) is chosen. The data and the pre- 
dictor space are partitioned into two groups. The 
same algorithm is then recursively applied to each 
of the resulting groups. Therefore, at any point on 
the regression tree, a split at a node h yields two 
nodes which can also be represented with the pair 
of basis functions 

b hj ( X ) = H x h(j) G S h(j) } and 

= I i x h(j) i s h(j),}, 
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where S^fj) is a subset of the values of X h ^, lead- 
ing to terminal nodes basis functions gh{X) which 
products of such indicator functions built up at each 
step. Typically a large tree is grown to avoid miss- 
ing structure and then pruned back: model complex- 
ity is reduced by constructing a backward sequence 
of models using the cost-complexity pruning algo- 
rithm. 

(d) Other Expansion Models. Of course, other in- 
teraction outside the tensor product form of individ- 
ual predictors can be expressed in this general form 
using parametric or nonparametric smooth based 
component functions. For instance, multilayer neu- 
ral networks construct the 5i(X) as composite func- 
tion of functions of a linear combinations of sub- 
sets of the predictors <7i(X) = (f>(Yli=i a iX-i)- The 
regression tree methods above can also be extended 
to indicator functions based on linear combinations 
{^2 ctiXi > c}. One implementation, Flextree (Huang 
et al., 2004), uses this model form. 

It is clear that with more than a modest number of 
predictors the potential number of interaction mod- 
els is huge and, hence, variance control is critical in 
the model search. 

Bias/Variance Trade-Off: Picking Model 
Complexity 

Stepwise logistic regression, tree-based methods 
and adaptive regression splines use a forward se- 
lection strategy. A final model can be selected to 
minimize a penalized measure of error, 

-l a = -l(M,P;Yi,Xi,i = l,...,n) + a\M\, 

where l(A4, /3; Yi, Xj, i = l,...,n) is the fitted log- 
likelihood for a model (of dimension \M\) that was 
considered, and a is a penalty parameter. Small 
penalty parameters would lead to large models with 
limited bias, but potentially high variance; larger 
penalty parameters lead to the selection of models 
biased toward the null model, but with less vari- 
ance. Given sufficient computation time, a model 
that minimizes the negative of the cross-validated 
likelihood (or some resampling analog) may be the 
preferred method to address the bias versus variance 
trade-off. 

The above penalty only involves the number of 
parameters; often other common penalties can be 
useful. Additional terms that penalize either the L 1 
norm of the coefficients (e.g., LASSO, Tibshirani, 
1996) or the L 2 norm of the coefficients (Ridge Re- 
gression, Hoerl and Kennard, 1970) or a linear com- 
bination of the two penalty terms (e.g., Elastic Net, 



Zou and Hastie, 2005) can lead to additional effec- 
tive ways to control variance. 

While these are flexible statistical strategies for 
interaction model building and regularization, they 
do not directly incorporate any genomic structure of 
the predictors. As we describe in the next section, 
improved inferences, including improved power for 
testing, can be obtained by incorporating the special 
form of trinary SNP data 0,1,2, the nature of de- 
pendence and/or independence between SNPs and 
environmental variables, and haplotype structures. 

3. MODELS FOR INTERACTIONS IN 
GENETIC STUDIES 

Genetic data have a number of special features 
that can be exploited in modeling interactions. In 
this section we discuss some of those special features 
of genetic data, and how they have been used in 
modeling interactions. Most of these methods can- 
not deal with all SNPs as are commonly measured 
in GWAS simultaneously. However, they can be di- 
rectly used for targeted regions based on either prior 
biological hypothesis or top hits from an initial single- 
SNP filtering, as discussed in Section 3.4. It is hard 
to put the size restrictions of the various methods 
on one scale. For example, the SHARE method, dis- 
cussed in Section 3.2, is intended to find interac- 
tions within a block of SNPs in linkage disequilib- 
rium, and would thus be applied to a fairly small 
number of SNPs, for instance, 50 tag SNPs between 
recombination hotspots. Nevertheless, for SHARE 
it is straightforward to apply it to a GWAS using 
a "sliding window" approach, where the method is 
applied to overlapping blocks of SNPs that are close 
to each other in the genome. 

On the other hand, methods like Multifactor Di- 
mensionality Reduction and Logic Regression, dis- 
cussed in Section 3.1, are intended to find long-range 
interactions between a smaller number of SNPs. These 
methods do not scale up to complete GWAS. How- 
ever, they could be applied to subsets of SNPs, like 
candidate gene studies, SNPs in a particular path- 
way or SNPs that attain a certain (marginal) signifi- 
cance level. The number of SNPs that these methods 
can deal with is typically up to a few hundred, or 
maybe a thousand, obviously depending on the sam- 
ple size (so that the methods have enough power) 
and the available computing resources (so that suf- 
ficiently many models can be examined in a rea- 
sonable time). These limitations are quite under- 
standable if we just consider the number of SNPs 
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measured in a GWAS. With, say, one million SNPs, 
there are 5 x 10 11 two-SNP combinations. So, even 
examining the simplest interaction model for each 
combination of SNPs is expensive. 

These size restrictions are much less severe for 
identifying gene x environment interactions. Again, 
this becomes clear from examining the scale of the 
problem. Typically we will only be interested in a 
few environmental factors, thus, the number of po- 
tential single SNP x environment interactions is 
smaller than the potential number of SNP x SNP 
interactions. Thus, some of the ideas on how to iden- 
tify gene x environment interactions discussed in 
Section 3.3 are directly applicable to GWAS. 

To find gene x gene interactions in a GWAS, we 
have to take a much simpler approach, for example, 
the two-stage approach discussed in Section 3.4. 

3.1 Genetic Data Is "Almost Binary" 

Humans carry two copies of each chromosome, 
and most genetic data comes from typing of Sin- 
gle Nucleotide Polymorphisms (SNPs). SNP data is 
commonly coded as 0/1/2, indicating the number of 
minor alleles at a particular locus. If this locus has a 
dominant effect on a disease phenotype, the genetic 
factor X can be coded 1 if the SNP is 1 or 2, and 
otherwise, and if this locus has a recessive effect 
on a disease phenotype, the genetic factor X can be 
coded 1 if the SNP is 2, and otherwise. Dealing 
with binary data is attractive, as models are often 
easier to interpret, and many computations can be 
done more efficiently. 

This binary coding of genetic data is especially ex- 
ploited in Logic Regression (Ruczinski et al., 2003). 
The logic regression model is 

m 

g[E(Y\X, Z)] = A) + ^ Pj L j + E Pk+ m Z k , 
j=i k 

where Y is the disease response, X a vector of re- 
coded SNPs, Z a vector of other (environmental) 
covariates, g(-) is a link function, such as the logit, 
and the Lj are binary combinations of the X, such 
as 

((Ai and X%) or X 3 ). 

The Lj can be interpreted as risk factors. In Logic 
Regression model selection is carried out using per- 
mutation tests and cross-validation. In particular, 
conditional permutation tests are used to select sim- 
pler models when those fit the data. An alterna- 
tive approach is to sample Logic Regression models 



using Markov chain Monte Carlo (Kooperberg and 
Ruczinski, 2005) or bagging (Schwender and Ick- 
stadt, 2007). The search among candidate models 
is carried out using a stochastic simulated anneal- 
ing algorithm, though if the number of SNPs (X) is 
limited and the maximum number of SNPs in each 
Lj were limited to, say, 3, all models could be enu- 
merated. 

Logic Regression can be used to find gene x gene 
interactions, and gene x environment interactions 
for binary environmental variables. For example, in 
Kooperberg et al. (2007), for an analysis of a can- 
didate gene study of cardiovascular disease among 
hypertensive, Logic Regression was used to identify 
the model 

logit [^(myocardial infarction) 

AGTR2 SNPs, hypertensive drugs)] 

= -0.90 - 0.72[(> 1 A allele at rsl71231429) 

and (no calcium channel 

blockers)]. 

As most methods described in this section, Logic 
Regression does not scale up well to the size of GWAS 
and needs some selection of SNPs. The stochastic 
search algorithm could not possibly examine a suffi- 
ciently large number of models when there are hun- 
dreds of thousands of SNPs. Permutation tests or 
cross-validation are even more prohibitive. 

While Logic Regression reduces SNPs from a 0/1/2 
variable to a binary variable, multifactor dimension- 
ality reduction (MDR, Ritchie et al., 2001) makes 
use of the 0/1/2 nature of the SNP data. For two 
specific SNPs MDR divides the nine combinations 
into those that are associated with high and low 
risk for a particular outcome. Thus, an MDR model 
for two SNPs may be 

SNP A 








1 


2 





H 


H 


L 


SNP B 1 


L 


H 


H 


2 


L 


H 


L 


where H and L refer to 


righ 


anc 


low 



low risk, respec- 
tively. Three-level interactions are modeled using all 
27 possible combinations of three SNPs, and so on. 
Among all interactions up to a certain level, MDR 
chooses the best combination by cross-validation. 
Dividing the SNP combinations in high-risk and low- 
risk clearly has a close connection to classification 
trees (see Section 2), but the nonmonotonicity for 
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some SNP combinations (e.g., for SNP B = 2 in 
the example above) makes the method less regu- 
larized and maybe harder to interpret. The cross- 
validation for MDR does not specifically prefer lower 
order models over interactions. Thus, for example, 
a two-level MDR may be chosen, by chance, if in 
fact there are two main effects. This leads to poten- 
tially increasing the type 1 error of incorrectly iden- 
tifying an interaction (but not increasing the type 
1 error of incorrectly identifying a genetic effect). 
Similarly, three SNP interactions may be identified 
when a model with two SNP interactions fits the 
model well. MDR has been applied to a substantial 
number of candidate gene studies. 

3.2 Linkage Disequilibrium 

SNPs that are close to each other on the chromo- 
some are typically highly correlated, because of the 
shared ancestral history. The extent of this correla- 
tion or linkage disequilibrium (LD) is known from 
databases such as the HapMap (HapMap Consotium, 
2007) and the 1000 genomes project (Kaiser, 2008) 
which is currently underway. Known LD structure 
can be used to impute SNPs that are not measured 
(e.g., Servin and Stephens, 2007; Marchini et al., 
2007). LD can also be used to develop multilocus 
association methods, often based on haplotype re- 
constructions (e.g., Browning and Browning, 2007; 
Epstein et al., 2003; Lin and Zeng, 2006). Statisti- 
cally, the main effect of a haplotype can be deemed 
as a combination of main effects and interactions in a 
locus (SNP) regression model (Schaid, 2001). Using 
haplotypes can be an effective way to model interac- 
tions between multiple mutations within a gene. The 
latter perspective may receive an increasing appreci- 
ation as genome-wide sequencing technologies hold 
promise to directly capture the rare variants in the 
near future. Several recent candidate gene studies 
suggested that accumulation of multiple rare alleles 
may contribute to the risk of some common diseases 
(Vermeire et al., 2002; Cohen et al., 2004; Nejent- 
sev et al., 2009). Theoretic population genetic stud- 
ies also support the presence of multiple deleterious 
susceptibility loci with low frequencies (Pritchard, 
2001, Kryukov, Pennacchio and Sunyaev, 2007). To 
this end, a haplotype analysis can be useful to as- 
semble the cumulative, possibly interactive, effect of 
rare variants within a gene. 

Certainly locus-based models, such as the stepwise 
regression method of Cordell and Clayton (2002), 
can be used to model the local interactions directly 



without having to deal with haplotype phase ambi- 
guity. However, with the high density of SNP data 
currently being collected, haplotype phase ambigu- 
ity is less a concern for power. Furthermore, the 
main effect of a haplotype already contains inter- 
actions when casted in locus (SNP)-based models, 
and hence tend to be more parsimonious than locus- 
based models that may require high-order interac- 
tion terms. In situations where the LD is strong, 
additional power gain can be achieved by fewer pa- 
rameters used to characterize the genetic risk pro- 
file. For instance, consider a situation with 2 SNPs 
in complete LD and hence 3 haplotypes (00, 10, 11). 
A logistic regression with additive main effects and 
interactions uses 3 parameters, whereas a logistic 
regression with additive haplotype main effects uses 
one less parameter. With more rare, recent variants 
(with less opportunity of recombination) discovered 
through sequencing, it is anticipated that haplotype- 
based models will be more cost-effective than locus- 
based models in modeling such local interactions. 

When many SNPs in a region (gene) are under in- 
vestigation, the number of haplotypes constructed 
from all SNPs can be excessively large. A haplo- 
type scan using a moving window of 5 to 10 SNPs 
may miss the interactions between SNPs separated 
in the region, since the 3-D structure of a protein 
can bring amino acids further apart into one func- 
tional domain. Strategies to perform model selec- 
tion seem inevitable in order to characterize rele- 
vant genetic variants. Dai et al. (2009) propose the 
SNP-Haplotype Adaptive REgression (SHARE) al- 
gorithm that seeks the most informative set of SNPs 
for genetic association in a targeted region by grow- 
ing and shrinking haplotypes with one more or less 
SNP in a stepwise fashion. Though it is not always 
"optimal," the stepwise selection is a rational choice 
given the computational demand facing a large num- 
ber of SNPs and the fact that haplotypes we observe 
today were formed by sequential (stepwise) muta- 
tions in history. It is hard to imagine that there is 
no marginal effect for a haplotype carrying disease 
risk. Contrary to the popular haplotype clustering 
approaches (e.g., Seltman et al., 2001; Durrant et al., 
2004; Browning and Browning, 2007), in the SHARE 
algorithm both the trait and the genotypes guide 
the model selection process, and the SNP selection 
is irrespective of the order of the SNPs in the region 
(gene) . Cross-validation is used to select the best set 
of SNPs for a haplotype analysis, and phase ambigu- 
ity is accounted for by treating haplotype estimation 
part of the procedure. 
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FlG. 2. Tree illustration of the sequential partition of hap- 
lotypes when 5 SNPs A-E are present. The left panel shows 
the growing set of SNPs used in the analysis and the right 
panel shows the partitions resulted from the haplotypes based 
on the current set of SNPs. The minimal set of SNPs that 
captures the genetic association is (A, D, E), with the disease 
risk concentrated on the haplotype "100". The path leading to 
discovering it is 1 —¥ 10 —¥ 100. The order of SNPs in the 
haplotypes follows A, AD and ADE respectively. 



Despite the resemblance to the stepwise logistic 
regression of Cordell and Clayton (2002), SHARE 
is actually more closely related to the Classification 
and Regression Tree (CART, Breiman et al., 1984) 
algorithm (see Section 2) in that it recursively parti- 
tions the haplotype sample space. Figure 2 shows a 
simple example of the pathway of partition. Among 
5 SNPs in a region, we first find the SNP A as 
the most significant SNP by a single-locus scan, so 
haplotypes are partitioned by "0" and "1" at the 
A locus. Next, haplotypes constructed by A and D 
are found to be the best 2-SNP haplotypes, and fi- 
nally, the algorithm reaches the most informative 
set {A, D, E} so that one 3-SNP haplotype con- 
centrates the disease risk. To increase the chance 
of finding the most informative set, we grow longer 
haplotypes and then prune back one SNP at a time. 
We use cross-validation to select the partition with 
the minimal prediction error. While CART is effec- 
tive to dissect high-order interactions, growing hap- 
lotypes is essentially refining high-order interactions 
between loci. The difference between SHARE and 
CART is that the recursive partitioning of CART 
is binary, while SHARE potentially creates multi- 
ple splits when adding one SNP, if there is recom- 
bination. Moreover, every subject has two sets of 
haplotypes and we need some genetic model to de- 
scribe the combined haplotype effect, such as addi- 
tive, dominant or recessive genetic models. 

3.3 Independence Assumptions 

When it is known that pairs of SNPs or SNPs and 
environmental factors are independent of each other 



in the population, this knowledge can be effectively 
used to identify interactions. Noticing that these in- 
dependent quantities are not independent in a par- 
ticular group of subjects (e.g., cases of a particular 
disease) now implies an interaction effect. This ap- 
proach has been used to develop methods for SNPs 
that are (assumed to be) in linkage equilibrium, and 
for nongenetic environmental factors. 

In randomized clinical trials treatment assignment 
is made independent of the genetic status of the 
participants. In some other situations it may also 
be reasonable to assume gene x environment inde- 
pendence for some environmental exposures. If for 
a case-control study this independence holds for the 
controls (e.g., under the usual rare disease assump- 
tion), it is immediate that a significant correlation 
between a genetic effect and the environmental fac- 
tor among cases implies a gene x environment in- 
teraction on the disease outcome. This is the ba- 
sic idea behind the case-only analysis (Albert et al., 
2001; Piegorsch, Weinberg and Taylor, 1994; Um- 
bach and Weinberg, 1997): there is a simple relation 
between the odds-ratio among the cases and the pa- 
rameter for the interaction in a linear logistic re- 
gression model. Recently a number of methods have 
been proposed to exploit this independence also for 
estimating main effects, and to avoid having to make 
the rare disease assumption (Chatterjee and Carroll, 
2005; Dai et al, 2009). 

It has been pointed out that violation of the gene- 
environment independence assumption can seriously 
increase the type 1 error (Albert et al., 2001). An 
empirical Bayes approach that "averages" the anal- 
ysis assuming independence and a traditional case- 
control analysis has been proposed as a save alter- 
native (Mukherjee and Chatterjee, 2008) that main- 
tains some of the advantage of the independence 
assumption, when that assumption cannot be fully 
confirmed. In this approach the effect estimate un- 
der the (unbiased) case-control design is averaged 
with the (more efficient) estimate using the case only 
analysis, with weights balancing the variance of the 
case-control estimator, and an empirical Bayes es- 
timate of the uncertainty of independence assump- 
tion. 

Clearly, testing whether two factors are indepen- 
dent (e.g., gene and environment among the con- 
trols) and then using a test with or without assum- 
ing independence, depending on the result of the 
independence test, will seriously inflate the type 1 
error. On the other hand, if a preliminary test is 
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independent of the final test for interaction in the 
data set, such a test can be used to prioritize poten- 
tial models that are tested, thereby alleviating the 
multiple comparisons problem. This is the approach 
taken by Millstein et al. (2006). In this paper the au- 
thors first test for (gene-gene) independence in the 
complete data set of cases and controls. Situations 
where there is substantial deviation of independence 
are prioritized for testing for interaction effects. The 
motivation seems to be that if two genes are depen- 
dent, this dependence may very well be different be- 
tween cases and controls, for example, but not neces- 
sary, because the genes are independent among the 
controls but not among the cases. As case-control 
status is not used in this preliminary analysis, the 
eventual analysis for interactions only needs to be 
corrected for the interactions actually tested, which 
increases power considerably. 

Linkage equilibrium In a homogeneous population 
SNPs that are on different chromosomes or far apart 
on the same chromosome are (approximately) inde- 
pendent. This suggests an alternative way to iden- 
tify interactions. For "rare diseases," this population- 
wise independence implies that for controls the SNPs 
should be independent. Simple properties of log- 
linear models show that dependence between two 
SNPs among the cases now implies an interaction 
effect of those SNPs on case-control status. Zhao 
et al. (2006) used this property to develop a test 
for interactions by reconstructing haplotypes (im- 
plicitly assuming Hardy-Weinberg equilibrium) be- 
tween two unlinked loci (SNPs) to get a measure of 
LD between these SNPs. Assuming no LD among 
the controls, they developed a (asymptotically x 2 ) 
test for an interaction effect of these two SNPs on a 
disease outcome. 

In practice, it is unlikely that a population is com- 
pletely homogeneous. Thus, it may be dangerous to 
assume that SNPs are indeed in linkage equilibrium 
among the controls. Given the large number of SNPs 
that are typically tested, a small amount of corre- 
lation will already inflate the type 1 error rate. It 
is, however, a valid test of interaction in genetic as- 
sociation studies to test whether the correlation be- 
tween two SNPs is the same among the cases and 
the controls. In fact, Zhao et al. (2006) also provide 
a test for interaction using this approach. However, 
the examples in their paper that compare their ap- 
proach to logistic regression (which does not use an 
independence assumption) make this independence 
assumption. 



Rajapakse, Perlman and Kooperberg (2009) gen- 
eralizes the approach by Zhao et al. (2006) by con- 
structing a correlation matrix between groups of 
SNPs using the generalized or composite LD of Weir 
(1996), separately for cases and controls. The ad- 
vantage of using the generalized LD over other mea- 
sures of LD is that, since phase information is not 
required, no haplotype reconstruction is needed, and 
under certain conditions the generalized LD between 
two SNPs reduces to the correlation between these 
SNPs when coded as 0/1/2. Using this approach, a 
simple test for identity between the correlation be- 
tween SNPs for the cases and controls becomes a 
test of the interaction effect of these SNPs on case- 
control status. There are a number of advantages 
for this approach, (i) This test of independence can 
easily be extended to blocks of SNPs. An interac- 
tion between two different blocks of SNPs on case- 
control status might suggest that a haplotype, that 
may be a surrogate for an unmeasured SNP in the 
first block, and a haplotype, that may be a surrogate 
for an unmeasured SNP in the second block, have 
an interaction effect on case-control status. There- 
fore, this method is an alternative to the method of 
Chatterjee et al. (2006) discussed in Section 4. (ii) 
A test for identity between the complete correlation 
matrix for the cases and the controls is a global test 
for interactions among the SNPs considered on case- 
control status, (iii) Assumed independence between 
selected SNPs among the controls can easily be in- 
corporated into this procedure by setting elements 
of the correlation matrix for the controls equal to 
0. When only two SNPs are examined, and no in- 
dependence assumption is made, we would expect 
the methods of Zhao et al. (2006) and Rajapakse, 
Perlman and Kooperberg (2009) to give similar re- 
sults, and that the results would be similar to the 
four degree-of-freedom test of comparing 
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Table 1 

Comparison of P-values for testing gene x gene interactions 









No independence assumed 


Independence assumed 






Logistic 


Zhao et al. 


Rajapakse, Perlman 


Zhao et al. 


Rajapakse, Perlman 


Method 




regression 


(2006) 


and Kooperberg (2009) 


(2006) 


and Kooperberg (2009) 


TP53 x 


CD14 


0.0418 


0.0325 


0.0364 


0.1062 


0.0210 


TNFR1 


x APOC3 


0.0001 


0.0005 


0.0005 


0.0009 


0.0000 


TP53 x 


MDM2 


0.0123 


0.0715 


0.0767 


0.8607 


0.0836 



= p + p 1 I(X i = T)+l32l(X i = 2) 
+ P 3 I(X j = l)+p 4 I(X j = 2). 

In the implementation of Rajapakse, Perlman and 
Kooperberg (2009) these tests use the Kullback- 
Leibler distance between two matrices. In Table 1 we 
provide results on a previously analyzed case-control 
study consisting of 779 heart disease patients, 342 
of whom showed restenosis, and 437 who did not 
(Hoh et al., 2001; Kooperberg and Ruczinski, 2005). 
All individuals were genotyped for 89 SNPs/variants 
in 62 genes that were previously associated with 
heart disease. We show results for three of the two- 
SNP interactions that were identified in Table III 
of Kooperberg and Ruczinski (2005) (the other four 
interactions in this table involved a variant that had 
no homozygotic minor allele subjects). The signifi- 
cance level for the methods of Zhao et al. (2006) 
and Rajapakse, Perlman and Kooperberg (2009) are 
based on 10,000 permutations. We note that, as ex- 
pected, the three methods give similar results when 
there is no independence assumption. When we do 
make an independence assumption, the approach of 
Zhao et al. (2006) appears less powerful than the 
one of Rajapakse, Perlman and Kooperberg (2009), 
which does not require a haplotype reconstruction. 
The approach of Rajapakse, Perlman and Kooper- 
berg (2009) offers the additional advantage of po- 
tential for extension to tests of interaction effects 
between blocks of SNPs. 

3.4 Using Main Effects to Find 
Interactions in GWAS 

The methods discussed above exploit the genetic 
structure of the data to be analyzed. However, mostly 
those approaches do not scale up to GWAS, both 
because methods become computationally too de- 
manding, and because the number of multiple com- 
parisons becomes so large that the power to identify 
significant interactions for anything other than the 



strongest effects is missing. Interestingly, testing all 
models that include an interaction for the combined 
effect of a particular SNP on a disease outcome can 
increase the power to identify an individual SNP as 
being associated with a disease outcome (Marchini 
et al., 2005; Evans et al., 2006), but it does not in- 
crease the power to identify an interaction. We dis- 
cuss this approach further in Section 4. 

If only a few environmental factors are examined, 
the problem to identify simple multiplicative gene 
x environment interactions is essentially the same 
as studying marginal effects. Thus, while power is 
limited, just like for any GWAS study, computa- 
tionally studying gene x environment interactions 
is straightforward. However, the filtering procedures 
that we suggest below for gene x gene interactions 
can increase the power to identify gene x environ- 
ment interactions in GWAS as well. 

While it is probably obvious that enumerating all 
interactions in a GWAS will be computationally too 
expensive, except for the simplest possible models 
involving just 2-SNP interactions, adaptive search 
algorithms do not circumvent these problems. Adap- 
tive algorithms can be roughly divided in those us- 
ing stochastic search algorithms and greedy search 
algorithms. Stochastic search algorithms, like the 
simulated annealing algorithm used by Logic Re- 
gression, search a stochastically selected set of mod- 
els, selecting the best fitting model(s) among those 
examined. For well structured model classes these 
algorithms can avoid looking at many poorly fitting 
models. However, to have a reasonable opportunity 
to find good fitting models, the number of models 
examined needs to increase considerably. In particu- 
lar, since in humans the extent of linkage disequilib- 
rium is small compared to the length of the genome 
(r 2 is typically much smaller than 0.8 after less than 
50 kb; Pritchard and Przeworski, 2001), the number 
of models that are examined needs to go up with 
close to the number of SNPs, and likely even more 
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if higher order interactions are studied to find good 
interactions. Greedy search algorithms, for example, 
the stepwise and tree algorithms described in Sec- 
tion 2, are much more likely to end up in local opti- 
mal models, that are globally not very good, when 
the number of predictors increases. 

For these reasons, the most viable solutions to ex- 
tend the methods discussed above to GWAS would 
be to select SNPs based on some marginal criterion, 
and only search for interactions among the selected 
SNPs. It is clear that this approach reduces the com- 
putational requirements. The two main questions 
are, however, as follows: 

• does a filtering procedure alleviate the multiple 
comparisons problem? 

• are we able to identify the "important" interac- 
tions? 

As Marchini et al. (2005) point out, the multi- 
ple comparisons correction for testing interactions 
after marginal filtering needs to take the filtering 
into account. The "safe" approach is to correct the 
number of tests, for example, using a family-wide 
error rate (FWER) or a false discovery rate (FDR) 
approach for the number of interactions that could 
have been examined. Clearly, with such an approach 
the power to identify significant interactions cannot 
be larger than when all possible interactions would 
have been examined (but at reduced computational 
cost). This was part of what was found by Marchini 
et al. (2005) and Evans et al. (2006). We should 
note here that, besides that this is computationally 
infeasible, there is no simple permutation test for 
(the strongest) interaction effect, as a simple per- 
mutation of case-control status not only removes the 
interaction effect, but also removes all main effects. 
Such a permutation test would thus be a test of main 
effect combined with interaction — a topic which we 
discuss in Section 4. 

Kooperberg and LeBlanc (2008) establish that if 
the marginal testing of SNPs is carried out using 
regression models of the form 

7o + 7iA"i, 

for some coding Xi of a SNP i and the interaction 
model examined is of the form 

(3) Po + Ptfi + faXj + foXiXj, 

then the least squares estimates of 71 and /?3 are 
independent. While the estimation in case-control 



studies is typically carried out using logistic regres- 
sion, this independence result gives some justifica- 
tion of only correcting for the number of tests that 
actually were examined, for example, using a Bon- 
feroni approach. Kooperberg and LeBlanc (2008) 
also develop a resampling procedure based on scores, 
extending a technique proposed by Lin (2006), that 
offers an alternative to permutation tests, and is ap- 
plicable to two stage studies in which only SNPs that 
are marginally significant are tested for interactions. 
In this approach a sample from the efficient score for 
the interaction model (3) is generated under the null 
hypothesis that P3 = 0. 

In particular, the efficient score for the addition 
of an interaction XiXj, conditional on and Xj 
already being in the model, is Uij = ^2 k , where 
Uijk = (Y k - Pi jk )(X ik X jk - n ijk ). Here Y k is case- 
control status for subject k, pij k is the fitted proba- 
bility of subject k being a case in a logistic regression 
model using Xi and Xj, but not X^Xj, as predictor 
for Y, and fMij k is the fitted value for subject k in 
the linear regression model using Xi and Xj as pre- 
dictor for X{Xj. Under the null hypothesis of no as- 
sociation, Uij is approximately normal with mean 
and variance Vij = ^ k Uf- k . ^ n computing the sig- 
nificance level, we compare T = maxjj Ufj/Vij with 
T* = maxij(Y, k UijkZ k ) 2 /Vij, where the Z k are in- 
dependent standard normal random variables. This 
approach does not assume independence of the stage 
one and two tests, as the Bonferoni approach does, 
but rather the "permutations" are carried out con- 
ditional on the results of the first stage. 

A simple routine for computing power calculations 
for two stage tests of interaction is implemented in 
the CRAN package powerGWASinteraction. Kooper- 
berg and LeBlanc (2008) contain extensive simu- 
lation studies establishing that the two-stage ap- 
proach indeed maintains the correct type 1 error 
rate, that the power in most reasonable situations is 
vastly improved over a one-stage analysis, and that 
this power is well approximated by the routines from 
powerGWASinteraction. 

Clearly, this approach can also be applied to iden- 
tify gene x environment interactions: now only one 
SNP needs to be marginally significant to be tested 
as part for a gene x environment interaction. In 
Table 2 we present power calculations for identify- 
ing a gene x environment interaction. We assume 
model (3) with Xi a binary SNP with P(Xi = 1) = 
0.4375, which corresponds to a dominant SNP effect 



INTERACTIONS IN GWAS 



13 



for a SNP with minor allele frequency 0.25, a bi- 
nary environmental factor Xj with P(Xj = 1) = 0.5, 
a case-control study with 5000 cases and controls, 
500,000 SNPs, ft = -2 (not a rare disease), ft = 
(no genetic effect when Xj = 0), ft = 0.5 (a moder- 
ate environmental effect), and an overall multiple- 
comparisons controlled significance level a = 0.05. 
We show results for several gene x environment in- 
teraction effects ft, and several levels for the 
marginal level of significance a\ that a SNP has to 
satisfy before it is tested for the gene x environment 
interaction. Besides power to identify the interaction 
using a regular analysis, we also show power for an 
analysis that assumes that the gene and the envi- 
ronmental factor are independent. 

We see from Table 2 that a two-stage procedure 
increases the power considerably. If only the top 500 
SNPs are tested for gene x environment interactions 
{ct\ = 0.0001), the power to identify an interaction 
with odds ratio 1.4 is over 70%. If the gene and en- 
vironmental factor are assumed to be independent, 
and we analyze the data using, for example, a case- 
only analysis or the approach of Dai et al. (2009), 
the power increases to about 90%. 

We applied the two-stage approach to the WTCCC 
Crohn's disease data (WTCCC, 2007). We identified 
211 SNPs that had minor allele frequency > 0.1, less 
than 5% missing data and marginal significance level 
p < 0.0001, and did not grossly violate Hardy Wein- 
berg Equilibrium. Among the ( 2 ^ 1 ) = 22155 two- 
SNP interactions among these SNPs, three had a 
q- value < 0.05 Storey and Tibshirani (FDR, 2003), 
and are thus plausible. Two of these interactions 
involve SNPs on different chromosomes, the third 
one involves two SNPs relatively close together on 
the same chromosome. Nineteen more possible inter- 
actions have q- values < 0.25, suggesting that more 



than ten of those show some reproducible associa- 
tion with Crohn's disease. 

If the effect of a SNP goes in the opposite direction 
for two levels of another SNP or an environmental 
variable, a two-stage analysis may have less power 
than a one-stage analysis (if these opposite effects 
just cancel each other out). We believe that some 
of the more unusual interaction effects considered 
in Evans et al. (2006) are less likely, and we believe 
that using the power for more likely scenarios is a 
potentially more fruitful way of a-spending. 

4. USING INTERACTIONS TO FIND MAIN 
EFFECTS IN GWAS 

Modeling gene x environment or gene x gene in- 
teractions is useful even if the goal of the analysis 
is primarily the identification of simple gene-disease 
associations. For instance, if not acknowledged in 
the analysis method, interactions can lead to atten- 
uation of the marginal effect size and reduce the 
power to detect true associations. Several authors 
have incorporated interactions into their search of 
marginal genetic association. A common thread of 
successful methods is that they allow model flexibil- 
ity, but not so much model flexibility as to substan- 
tially increase variance. 

For instance, in the simple two variable model (1) 
in Section 2, one can test the overall association of 
X\ with disease outcome by testing the null hypoth- 
esis Hq : ft = ft = using a 2 degree of freedom 
test. Exploiting potential interactions to detect ge- 
netic association with this simple testing technique 
has been explored by Kraft et al. (2007). In a less 
directed fashion, Marchini et al. (2005) tested both 
the main effects and interactions to explain the 3 x 
3 table of two SNPs to assess individual associations. 



Table 2 

Sample power calculations for gene x environment interactions. The numbers 0.00001, . . . , 1 are the first stage significance ai 
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0.89 
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Extensions to multiple predictors can substantially 
increase the potential number of parameters. For in- 
stance, consider two sets of variables Xu and X$j, 
which could represent two sets of SNPs or SNPs 
and environmental variables. For assessing the asso- 
ciation of a given Xu with outcome, one can simul- 
taneously test the 1 + q terms (3u and (if there 
are multiple Xu, i = 1, . . . ,p, testing involves p + pq 
terms) in the model 

v 1 
r?(X) = A, + Pii x K + S fe X *i + Yl N X uX 2j , 

i=l j=l i,j 

where the components X 2 j could represent other 
SNPs or environmental factors. The difficulty is that 
as q increases, the potential power of the test may 
significantly decrease due to the increased number 
of parameters. One way to limit model complexity is 
to specify a restricted form for the interaction model 
such as 

v Q 
r?(X) =/3 + X^ ft**!* + E 

i=l j=l 
+ Plifi2jXiiX2j ■ 

This class of models is used by Chatterjee et al. 
(2006), and relates to the idea of a one-degree of free- 
dom interaction test, dating back to Tukey (1949). 
Suppose a gene is identified by p SNPs Xu, then a 
hypothesis of no association with outcome for this 
set Xu could be phrased as Hq : flu = 0, i = 1, . . . ,p, 
where this indicates no association through main 
effects or interactions since the /3u also appear in 
the interaction term. The strategy assesses overall 
variable importance of a gene (potentially repre- 
sented by several Xu) in the context of a more gen- 
eral model which could include interactions. For in- 
stance, one could use any regularized and or stepwise 
model building strategy (e.g., regression trees or re- 
gression splines described in Section 2, or ensembles 
of such models) and evaluate the impact of remov- 
ing the gene of interest from the model. One tech- 
nique to measure the importance of the variable is 
to evaluate the difference in the fit or log-likelihood 
compared to the fit with the gene permuted with 
respect to all other variables (e.g., Breiman, 2001). 
However, this last strategy is likely not computa- 
tionally feasible in the context of GWAS. 

An alternative idea is to modify simple gene as- 
sociation test statistics by weighting them to take 



advantage of an interaction and increase power of 
the test statistic. For instance, one could focus on 
subgroups of subjects to test for genetic association. 
In addition, if the procedure was computational ef- 
ficient, it could be an alternative to fitting full in- 
teractions with maximum likelihood methods. 

Since many useful association tests are score type 
statistics, one can outline the method in terms of 
weighted score test statistics. Let Z denote an en- 
vironment or treatment variable and Xj a genetic 
factor. For example, in the case of binary outcome 
data, let Xji be the gene j value and Z k i environ- 
mental factor k for individual i, the score component 
would be Uji = Xji(Yi - exp(a + /3Z ki ))/(l + (a + 
f3Zki)). If the association is thought to be stronger 
in a subgroup of subjects (e.g., heavier smokers) 
based on some environmental factors, then a sub- 
group weighted marginal test statistic 

n 

U w (Z,X j ;9) = ^2h(Z i ,e)U ji , 

i=l 

where h(Z; 9) = I{Z > 6} and Z is an ordered en- 
vironmental variable, may have more power. The 
generalization to multiple or alternative basis func- 
tions 

n q 

u w (z, x f ,e) = J2J2 <Xkh{Z ik ,e k )U 3i 

i=l k=l 

allows more flexibility. The basis functions, h(Zi k ,6 k ), 
could be simple subset functions, h(Zik;6) = I{Zi k > 
Cfc}, or piecewise linear functions, h{Z^;9) = {Zi k — 
Cfe} + , similar to those used in tree-based models and 
regression spline models described in Section 2. As 
the direction of association is usually unknown, the 
weights a k can be derived from the data. LeBlanc 
and Kooperberg (2009) obtain weights using stage- 
wise regression based on the Least Angle Regres- 
sion algorithm (LARS, Efron et al., 2004), where 
the score components are the outcome variable. The 
intent is to focus the test statistic on the environ- 
mental combination which leads to maximal genetic 
association. They show that weighting the associa- 
tion test statistics can significantly increase power 
in many simulated situations where gene x envi- 
ronment interactions exist. As an example, Figure 3 
shows simulations for 2000 cases and 2000 controls 
generated from the logistic interaction regression 
model 
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V (Z, X)=f3 + foX + foh(Z) + foXh{Z), 

where we assume a binary SNP, X with frequency 
equal to 0.2, and h(Z) a function of several envi- 
ronmental variables. There were five environmen- 
tal variables corresponding to the basis functions 
(1, {Zj < co.25}, {Zj < C0.50}, {Zj < c . 75 })j = 1, . . . , 5, 
available for modeling and h(Z) depended on the 
linear combination of two of those variables: h(Z) = 
\Z\ < co.50} — {Z2 < Co. 50} + 0.25. We evaluated a 
marginal test of the 1 + k, fc = 3 x 5 parameters 
(including main gene effect and the modifying vari- 
ables), and a regularized stage- wise test. The type I 
error was controlled to approximately 0.00001. The 
results are presented in Figure 3. The parameter val- 
ues main effects are f}\ = 0.05 and fa = so that 
the small genetic effect increases as h(Z) and (3^ in- 
crease. This would be a plausible scenario for effects 
in a GWAS, a genetic effect that is more apparent 
within a subgroup of subjects exposed to a set of 
environmental conditions. This model is similar to 
the hypothetical effects shown in the left panel of 
Figure 1. 

As the magnitude of the interaction effect increases, 
using the more complex model and joint association 
testing substantially increases power over marginal 
testing. The full model weighting, depending on 16 
basis functions and parameters, suffers somewhat 
from increased variance. However, the stage- wise test 
statistic performs the best of the three methods by 
controlling the overall variance. The tuning parame- 
ter for the stage-wise method was set to correspond 
to approximately 2.5 degrees of freedom. 

Therefore, if there are one or a small number of 
well characterized environmental factors that are sub- 
stantially modifying the association of a gene with 
disease, statistical strategies which incorporate in- 
teractions, and jointly test the main effect and the 
interaction, are useful for improving power over 
marginal association tests. 

While the adaptive weighting strategy is more com- 
putationally demanding than calculating traditional 
score test statistics, it is feasible to conduct the anal- 
ysis on the GWAS because each SNP calculation is 
independent and the Least Angle Regression algo- 
rithm using a small number of environmental vari- 
ables is very efficient, if the tuning parameter is set 
a priori as we suggest. However, the impact on vari- 
ance is potentially a greater concern, there is still 
likely be some power advantage of filtering on main 
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Fig. 3. Using interactions in tests of association: Power 
for marginal test, full 1 + k parameter association test, and 
stage-wise weighted test with 2000 cases and 2000 controls and 
a = 0.00001. The full 1 + k parameter test is based on testing 
the main effect and all interaction terms in a logistic regres- 
sion model and the stage-wise test is based on using weighted 
score test statistics where the weights are derived from the 
Least Angle Regression (LARS) algorithm. The main effect 
parameters in the data generating model are j3\ = 0.05 and 

02=0. 

effects to, say, a small number of 100-1000s, before 
applying the adaptive method analogous strategy 
described in Section 3. 

5. DISCUSSION 

Identifying interactions is typically not the main 
goal of a GWAS analysis. Interaction effects may 
teach us things about the biology behind a disease, 
or they may be beneficial in constructing predica- 
tion models. However, at least as important, inter- 
action effects or differences in the gene (SNP) effect 
between different subgroups may actually help us in 
identifying the significant SNPs. Therefore, we be- 
lieve that it is important to pay some attention to 
the identification of interactions. 

Because of the number of SNPs under consider- 
ation in a typical GWAS, it is virtually impossible 
to identify gene x gene interaction effects, unless 
additional assumptions are being made. We believe 
that the most fruitful approach is to first identify 
SNPs that are (marginally) associated with a dis- 
ease, and then examine interactions involving those 
SNPs. Not only does this seem reasonable because 
SNPs that have an interaction effect typically will 
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also show some modest main effect, it also adheres 
to a basic premise in statistical modeling which re- 
duces the variance in model building: don't model 
interactions without main effects. 

After such initial filtering, there is a substantial 
number of approaches that can be used to identify 
interactions that make use of the specific form of ge- 
netic data. This is a reasonable two-stage approach 
if the methods to identify interactions are used to 
independent data than what was used to identify the 
marginal significant SNPs. If the same data is used, 
however, care has to be taken that the initial selec- 
tion of SNPs does not bias the inference about the 
interactions. We have shown that for a simple model 
this is possible — but this is certainly not generally 
true. 

The story for gene x environment interactions is 
similar. The problem of identifying such interactions 
is "smaller," but it is still so large that some filtering 
will often increase the power. 

Exploiting the genetic structures and making ad- 
ditional assumptions, like gene x gene independence 
among genes on different chromosomes, among con- 
trols, or gene x environment independence, can sub- 
stantially increase the power to identify interactions. 
Clearly, however, if the assumptions are not true, 
making those assumptions can substantially increase 
the type 1 error. Thus, if those assumptions are un- 
certain, an empirical Bayes approach like the one 
by Mukherjee and Chatterjee (2008) (see Section 3) 
may be safer. 
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